Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 22, 2025

📄 7% (0.07x) speedup for find_migrations in chromadb/db/migrations.py

⏱️ Runtime : 3.05 milliseconds 2.85 milliseconds (best of 51 runs)

📝 Explanation and details

The optimized code achieves a 6% speedup through two key algorithmic improvements:

1. Single-pass filtering in find_migrations:

  • Original: Creates intermediate lists with list comprehension, then filters by scope, then sorts
  • Optimized: Combines parsing and scope filtering in a single loop, avoiding redundant iterations and memory allocations
  • Impact: Eliminates the filter() call that was processing all files again after parsing (1.9% of total time in profiler)

2. Streamlined file validation in _read_migration_file:

  • Original: Checks both "path" not in file and not file["path"].is_file()
  • Optimized: Extracts path = file["path"] once and only checks not path.is_file()
  • Impact: Reduces dictionary lookups and simplifies the conditional logic

Performance characteristics by test case:

  • Small datasets (1-10 files): 13-25% improvement due to reduced overhead
  • Large datasets (100+ files): 6-12% improvement, showing the optimization scales well
  • Empty directories: 68% improvement due to eliminated intermediate list creation
  • Mixed scope filtering: Particularly effective since scope filtering happens during parsing rather than as a separate pass

The optimizations are most effective when processing directories with many non-matching files or mixed scopes, as the single-pass approach avoids building and then filtering large intermediate collections.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 14 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import hashlib
# --- Begin: Function under test (copied from prompt) ---
import re
import sys
from typing import Any, Dict, List

# imports
import pytest
from chromadb.db.migrations import find_migrations


# --- Begin: Minimal stubs for types and exceptions used in the function ---
class InvalidMigrationFilename(Exception):
    pass

class InvalidHashError(Exception):
    def __init__(self, alg):
        super().__init__(f"Invalid hash algorithm: {alg}")
Migration = Dict[str, Any]

# --- End: Minimal stubs ---

# --- Begin: Minimal Traversable mock for testing ---
class MockTraversable:
    """A minimal Traversable mock for testing."""
    def __init__(self, name, is_file=True, text=None):
        self.name = name
        self._is_file = is_file
        self._text = text

    def is_file(self):
        return self._is_file

    def read_text(self):
        if not self._is_file:
            raise FileNotFoundError(f"{self.name} is not a file")
        return self._text if self._text is not None else ""

class MockDir:
    """A minimal directory mock, with .name and .iterdir()."""
    def __init__(self, name, files: List[MockTraversable]):
        self.name = name
        self._files = files

    def iterdir(self):
        return iter(self._files)
from chromadb.db.migrations import \
    find_migrations  # --- End: Function under test ---

# --- Begin: Unit tests ---
# Basic Test Cases

def test_single_migration_basic_md5():
    # One migration file, correct format, md5 hash
    sql_text = "CREATE TABLE test (id INT);"
    file = MockTraversable("00001-users.sqlite.sql", is_file=True, text=sql_text)
    dir = MockDir("migrations", [file])
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 16.8μs -> 14.6μs (15.1% faster)
    m = migrations[0]
    # Check hash correctness
    expected_hash = (
        hashlib.md5(sql_text.encode("utf-8"), usedforsecurity=False).hexdigest()
        if sys.version_info >= (3, 9)
        else hashlib.md5(sql_text.encode("utf-8")).hexdigest()
    )

def test_single_migration_basic_sha256():
    # One migration file, correct format, sha256 hash
    sql_text = "CREATE TABLE test (id INT);"
    file = MockTraversable("00001-users.sqlite.sql", is_file=True, text=sql_text)
    dir = MockDir("migrations", [file])
    codeflash_output = find_migrations(dir, "sqlite", "sha256"); migrations = codeflash_output # 11.3μs -> 10.0μs (13.1% faster)
    m = migrations[0]
    expected_hash = hashlib.sha256(sql_text.encode("utf-8")).hexdigest()

def test_multiple_migrations_sorted_and_filtered():
    # Multiple migration files, mixed scopes, ensure sorting and filtering
    files = [
        MockTraversable("00002-users.sqlite.sql", is_file=True, text="ALTER TABLE users ADD COLUMN age INT;"),
        MockTraversable("00001-users.sqlite.sql", is_file=True, text="CREATE TABLE users (id INT);"),
        MockTraversable("00001-users.postgres.sql", is_file=True, text="CREATE TABLE users (id SERIAL);"),
        MockTraversable("00003-users.sqlite.sql", is_file=True, text="DROP TABLE users;"),
    ]
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 18.0μs -> 16.4μs (9.74% faster)

def test_no_matching_scope_returns_empty():
    # No migrations matching the scope
    files = [
        MockTraversable("00001-users.postgres.sql", is_file=True, text="CREATE TABLE users (id SERIAL);"),
    ]
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 6.14μs -> 4.90μs (25.3% faster)

def test_non_sql_files_are_ignored():
    # Only .sql files are considered
    files = [
        MockTraversable("00001-users.sqlite.txt", is_file=True, text="not sql"),
        MockTraversable("00001-users.sqlite.sql", is_file=True, text="CREATE TABLE users (id INT);"),
        MockTraversable("README.md", is_file=True, text="readme"),
    ]
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 12.1μs -> 10.5μs (15.1% faster)

# Edge Test Cases


def test_missing_file_raises():
    # File is not a file (is_file returns False)
    files = [
        MockTraversable("00001-users.sqlite.sql", is_file=False, text="CREATE TABLE users (id INT);"),
    ]
    dir = MockDir("migrations", files)
    with pytest.raises(FileNotFoundError):
        find_migrations(dir, "sqlite", "md5") # 10.8μs -> 8.92μs (21.4% faster)


def test_empty_directory_returns_empty():
    # No files in directory
    dir = MockDir("migrations", [])
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 3.98μs -> 2.38μs (67.6% faster)

def test_duplicate_versions_are_sorted():
    # Multiple migrations with same version number, should sort and include both
    files = [
        MockTraversable("00001-users.sqlite.sql", is_file=True, text="CREATE TABLE users (id INT);"),
        MockTraversable("00001-accounts.sqlite.sql", is_file=True, text="CREATE TABLE accounts (id INT);"),
    ]
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 19.3μs -> 17.5μs (10.1% faster)

def test_sql_file_with_empty_content_hashes_correctly():
    # SQL file with empty content
    files = [
        MockTraversable("00001-users.sqlite.sql", is_file=True, text=""),
    ]
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 11.0μs -> 9.51μs (15.5% faster)
    expected_hash = (
        hashlib.md5(b"", usedforsecurity=False).hexdigest()
        if sys.version_info >= (3, 9)
        else hashlib.md5(b"").hexdigest()
    )

def test_filename_with_leading_zeros_and_large_version():
    # File with large version number and leading zeros
    files = [
        MockTraversable("000123-users.sqlite.sql", is_file=True, text="SELECT 1;"),
    ]
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 10.5μs -> 9.13μs (14.8% faster)

def test_filename_with_extra_dots_in_name():
    # Filename with extra dots in the "name" part (should still parse correctly)
    files = [
        MockTraversable("00001-users.v2.sqlite.sql", is_file=True, text="SELECT 1;"),
    ]
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 10.4μs -> 9.07μs (14.8% faster)

# Large Scale Test Cases

def test_large_number_of_migrations_sorted_and_filtered():
    # 100 migrations, mixed scopes, ensure sorting and filtering
    files = []
    for i in range(1, 101):
        files.append(MockTraversable(f"{i:05d}-users.sqlite.sql", is_file=True, text=f"-- Migration {i}"))
        files.append(MockTraversable(f"{i:05d}-users.postgres.sql", is_file=True, text=f"-- Migration {i}"))
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 273μs -> 244μs (11.7% faster)
    # Correct sql content
    for i, m in enumerate(migrations, start=1):
        pass

def test_large_number_of_migrations_with_duplicates():
    # 500 migrations, with duplicate versions
    files = []
    for i in range(1, 251):
        files.append(MockTraversable(f"{i:05d}-users.sqlite.sql", is_file=True, text=f"-- Migration {i}"))
        files.append(MockTraversable(f"{i:05d}-accounts.sqlite.sql", is_file=True, text=f"-- Migration {i}"))
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 903μs -> 848μs (6.47% faster)
    # All version numbers from 1 to 250, each appearing twice
    versions = [m["version"] for m in migrations]
    for v in range(1, 251):
        pass

def test_performance_large_scale():
    # 999 migrations, ensure function completes and returns correct result
    files = []
    for i in range(1, 1000):
        files.append(MockTraversable(f"{i:05d}-users.sqlite.sql", is_file=True, text=f"-- Migration {i}"))
    dir = MockDir("migrations", files)
    codeflash_output = find_migrations(dir, "sqlite", "md5"); migrations = codeflash_output # 1.74ms -> 1.64ms (5.92% faster)
    # Spot check a few hashes
    for idx in [0, 499, 998]:
        sql_text = f"-- Migration {idx+1}"
        expected_hash = (
            hashlib.md5(sql_text.encode("utf-8"), usedforsecurity=False).hexdigest()
            if sys.version_info >= (3, 9)
            else hashlib.md5(sql_text.encode("utf-8")).hexdigest()
        )
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import hashlib
import re
import sys
import types
from typing import List, Sequence

# imports
import pytest
from chromadb.db.migrations import find_migrations
from importlib_resources.abc import Traversable

Migration = dict

class InvalidMigrationFilename(Exception):
    pass

class InvalidHashError(Exception):
    def __init__(self, alg):
        super().__init__(f"Invalid hash algorithm: {alg}")
from chromadb.db.migrations import find_migrations

# --- Test helpers ---

class FakeFile(Traversable):
    """A fake file object for testing."""
    def __init__(self, name, content, is_file=True):
        self.name = name
        self._content = content
        self._is_file = is_file

    def is_file(self):
        return self._is_file

    def read_text(self):
        return self._content

    # Traversable interface requires the following, but not used here
    def iterdir(self):
        raise NotImplementedError()

    def joinpath(self, child):
        raise NotImplementedError()

    def open(self, mode='r', *args, **kwargs):
        raise NotImplementedError()

    def __truediv__(self, child):
        raise NotImplementedError()

    def exists(self):
        return True

class FakeDir(Traversable):
    """A fake directory object for testing."""
    def __init__(self, name, files: List[FakeFile]):
        self.name = name
        self._files = files

    def iterdir(self):
        return iter(self._files)

    # Traversable interface requires the following, but not used here
    def joinpath(self, child):
        raise NotImplementedError()

    def is_file(self):
        return False

    def open(self, mode='r', *args, **kwargs):
        raise NotImplementedError()

    def __truediv__(self, child):
        raise NotImplementedError()

    def exists(self):
        return True

    def read_text(self):
        raise NotImplementedError()

# --- Unit tests ---

# 1. Basic Test Cases





















#------------------------------------------------
from chromadb.db.migrations import find_migrations

To edit these changes git checkout codeflash/optimize-find_migrations-mh1pdzko and push.

Codeflash

The optimized code achieves a **6% speedup** through two key algorithmic improvements:

**1. Single-pass filtering in `find_migrations`:**
- **Original**: Creates intermediate lists with list comprehension, then filters by scope, then sorts
- **Optimized**: Combines parsing and scope filtering in a single loop, avoiding redundant iterations and memory allocations
- **Impact**: Eliminates the `filter()` call that was processing all files again after parsing (1.9% of total time in profiler)

**2. Streamlined file validation in `_read_migration_file`:**
- **Original**: Checks both `"path" not in file` and `not file["path"].is_file()` 
- **Optimized**: Extracts `path = file["path"]` once and only checks `not path.is_file()`
- **Impact**: Reduces dictionary lookups and simplifies the conditional logic

**Performance characteristics by test case:**
- **Small datasets** (1-10 files): 13-25% improvement due to reduced overhead
- **Large datasets** (100+ files): 6-12% improvement, showing the optimization scales well
- **Empty directories**: 68% improvement due to eliminated intermediate list creation
- **Mixed scope filtering**: Particularly effective since scope filtering happens during parsing rather than as a separate pass

The optimizations are most effective when processing directories with many non-matching files or mixed scopes, as the single-pass approach avoids building and then filtering large intermediate collections.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 22, 2025 08:00
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants